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L’Apprentissage Semi-supervise avec Laplacian Regularise 

Resume : Nous etudions une methode d’apprentissage semi-supervise, base sur le graphe de 
similarity et Laplacian regularise. Nous formalisons la methode comme un probleme d’optimisation 
convexe et quadratique et nous etablissons ses diverses proprietes. En particulier, nous montrons 
que le noyau de la methode peut etre interprete en termes des marches aleatoires en temps 
discret et continu et possede plusieurs proprietes importantes des mesures de proximite. Les 
techniques d’optimisation ainsi que les techniques d’algebre lineaire peuvent etre utilise pour un 
calcul efficace des functions de classification. Nous demontrons sur des exemples numeriques 
que la methode de Laplacian regularise est concurrentiel par rapport aux autres etat de I’art 
methodes d’apprentissage semi-supervise. 

Mots-cles : Apprentissage Semi-supervise, Apprentissage base sur le graphe de similarity, 
Laplacian rygularisy, mesure de proximity, classification des articles Wikipedia 
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1 Introduction 

Graph-based semi-supervised learning methods have the following three principles at their foun¬ 
dation. The first principle is to use a few labelled points (points with known classification) 
together with the unlabelled data to tune the classifier. In contrast with the supervised machine 
learning, the semi-supervised learning creates a synergy between the training data and classifi¬ 
cation data. This drastically reduces the size of the training set and hence significantly reduces 
the cost of experts’ work. The second principal idea of the semi-supervised learning methods is 
to use a (weighted) similarity graph. If two data points are connected by an edge, this indicates 
some similarity of these points. Then, the weight of the edge, if present, reflects the degree of 
similarity. The result of classification is given in the form of classification functions. Each class 
has its own classification function defined over all data points. An element of a classification 
function gives a degree of relevance to the class for each data point. Then, the third principal 
idea of the semi-supervised learning methods is that the classification function should change 
smoothly over the similarity graph. Intuitively, nodes of the similarity graph that are closer 
together in some sense are more likely to belong to the same class. This idea of classification 
function smoothness can naturally be expressed using graph Laplacian or its modification. 

The work m seems to be the first work where the graph-based semi-supervised learning 
was introduced. The authors of EZ] formulated the semi-supervised learning method as a con¬ 
strained optimization problem involving graph Laplacian. Then, in |361135| the authors proposed 
optimization formulations based on several variations of the graph Laplacian. In [3] a unifying 
optimization framework was proposed which gives as particular cases the methods of and 
|36| . In addition, the general framework in [3] gives as a particular case an interesting PageRank 
based method, which provides robust classification with respect to the choice of the labelled 
points HIS]. We would like to note that the local graph partitioning problem diin] can be re¬ 
lated to graph-based semi-supervised learning. An interested reader can hnd more details about 
various semi-supervised learning methods in the surveys and books El 123 EH]- 

In the present work we study in detail a semi-supervised learning method based on the 
Regularized Laplacian. To the best of our knowledge, the idea of using Regularized Laplacian 
and its kernel for measuring proximity in graphs and application to mathematical sociology goes 
back to the works lHIISj- In |23| the authors compared experimentally many graph-based 
semi-supervised learning methods on several datasets and their conclusion was that the semi- 
supervised learning method based on the Regularized Laplacian kernel demonstrates one of the 
best performances on nearly all datasets. In [S] the authors studied a semi-supervised learning 
method based on the Normalized Laplacian graph kernel which also shows good performance. 
Interestingly, as we show below, if we choose Markovian Laplacian as a weight matrix, several 
known semi-supervised learning methods reduce to the Regularized Laplacian method. In this 
work we formulate the Regularized Laplacian method as a convex quadratic optimization problem 
which helps to design easily parallelizable numerical methods. In fact, the Regularized Laplacian 
method can be regarded as a Lagrangian relaxation of the method proposed in m- Of course, 
this is a more flexible formulation, since by choosing an appropriate value for the Lagrange 
multiplier one can always retrieve the method of m as a particular case. We establish various 
properties of the Regularized Laplacian method. In particular, we show that the kernel of 
the method can be interpreted in terms of discrete and continuous time random walks and 
possesses several important properties of proximity measures. Both optimization and linear 
algebra methods can be used for efficient computation of the classification functions. We discuss 
advantages and disadvantages of various numerical approaches. We demonstrate on numerical 
examples that the Regularized Laplacian method is competitive with respect to the other state 
of the art semi-supervised learning methods. 
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The paper is organized as follows: In the next section we formally dehne the Regularized 
Laplacian method. In Section 3 we discuss several related graph-based semi-supervised methods 
and graph kernels. In Section 4 we present insightful interpretations and properties of the 
Regularized Laplacian method. We analyse important limiting cases in Section 5. Then, in 
Section 6 we discuss various numerical approaches to compute the classification functions and 
show by numerical examples that the performance of the Regularized Laplacian method is better 
or comparable with the leading semi-supervised methods. Section 7 concludes the paper with 
directions for future research. 


2 Notations and method formulation 


Suppose one needs to classify N data points (nodes) into K classes and assume P data points 
are labelled. That is, we know the class to which each labelled point belongs. Denote by 14 the 
set of labelled points in class k = 1 , Of course, |Vi| |Vk| = P. 

The graph-based semi-supervised learning approach uses a weighted graph G = (V, A) con¬ 
necting data points, where V, \V\ = N, denotes the set of nodes and A denotes the weight 
(similarity) matrix. In this work we assume that A is symmetric and the underlying graph is 
connected. Each element a^- represents the degree of similarity between data points i and j. 
Denote by D the diagonal matrix with its (i, i)-element equal to the sum of the i-th row of 
matrix A: di = X]j=i - denote hy L = D — A the Standard (Combinatorial) Laplacian 
associated with the graph G. 

Dehne an N x K matrix Y as 


Y^k 


1, if i G 14, i.e., point i is labelled as a class k point, 
0, otherwise. 


We refer to each column Wfc of matrix T as a labeling function. Also dehne an TV x AT matrix 
F and call its columns F^k classification functions. The general idea of the graph-based semi- 
supervised learning is to hnd classihcation functions so that on the one hand they are close 
to the corresponding labeling function and on the other hand they change smoothly over the 
graph associated with the similarity matrix. This general idea can be expressed by means of the 
following particular optimization problem: 


min 

F 


K 

E 


K 


iF*k ~ T*fc)'^(T’*fc — Wfe) + (3 ^ ^ 




( 1 ) 


where j3 G (0,oo) is a regularization parameter. The regularization parameter /3 represents a 
trade-off between the closeness of the classihcation function to the labeling function and its 
smoothness. 

Since the Laplacian L is positive-semidehnite and the second term in o is strictly convex, 
the optimization problem o has a unique solution determined by the stationarity condition 

2(F*fc - + WFJkL = 0, k = G P, 


which gives 


= (/ + /3L)"^Wfe, fc = l,...,P. 


( 2 ) 


The matrix Qp = (J -f f3L) ^ is known as Regularized Laplacian kernel of the graph |28l I33| 
and can be related to the matrix forest theorems [HII] and stochastic matrices [T]. The classi¬ 
hcation functions P,fe,/c = 1,..., AT, can be obtained either by numerical linear algebra methods 


Inria 
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(e.g., power iterations) applied to ([2]) or by numerical optimization methods applied to ((T|). We 
elaborate on numerical methods in Section 6. Once the classification functions are obtained, the 
points are classified according to the rule 

Fik > Fiki,yk' ^ k ^ Point i is classified into class k. 

The ties can be broken in arbitrary fashion. 


3 Related approaches 

Let us discuss a number of related approaches. First, we discuss formal relations and in the 
numerical examples section we compare the approaches on some benchmark examples. 

3.1 Relation to heat kernels 

The authors of HZlllH] first introduced and studied the properties of the heat kernel based on 
the normalized Laplacian. Specifically, they introduced the kernel 

H(f) = exp(—1£), (3) 


where 

C = 

is the normalized Laplacian. Let us refer to 'H{t) as the normalized heat kernel. Note that the 
normalized heat kernel can be obtained as a solution of the following differential equation 

n{t) = -cnit), 

with the initial condition ?f(0) = I. Then, in (TH] the PageRank heat kernel was introduced 

n(t) = exp(-t(/-P)), (4) 

where 

P = P-M, (5) 

is the transition probability matrix of the standard random walk on the graph. In |20| the 
PageRank heat kernel was applied to local graph partitioning. 

In [55] the heat kernel based on the standard Laplacian 

H{t) = exp(—tL), (6) 

with L = D — A, was proposed as a kernel in the support vector machine learning method. 
Then, in |37| the authors proposed a semi-supervised learning method based on the solution of 
a heat diffusion equation with Dirichlet boundary conditions. Equivalently, the method of EZ] 
can be viewed as the minimization of the second term in o with the values of the classification 
functions fixed on the labelled points. Thus, the proposed approach © is more general as 
it can be viewed as a Lagrangian relaxation of EZj. The results of the method in m can be 
retrieved with a particular choice of the regularization parameter. 
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3.2 Relation to the generalized semi-supervised learning 
method 

In [5] the authors proposed a generalized optimization framework for graph based semi-supervised 
learning methods 


{ N N N I 

i=l 0=1 i=l J 

where Wij are the entries of a weight matrix W = (wij) which is a function of A (in particular, 
one can also take W = A). 

In particular, with cr = I we retrieve the transductive semi-supervised learning method Ea, 
with CT = 1/2 we retrieve the semi-supervised learning with local and global consistency [dti] and 
with cr = 0 we retrieve the PageRank based method [3]- 

The classification functions of the generalized graph based semi-supervised learning are given 
by ^ 

2 + fi \ 2 + g, J 

Now taking as the weight matrix W = I — tL = I — t{D — A) (note that with this choice of the 
weight matrix, the generalized degree matrix D' = diag(in) becomes the identity matrix), the 
above equation transforms to 

F., = P-lLf Y.,. k = 

which is (12]) with (3 = 2r//i. It is very interesting to observe that with the proposed choice of 
the weight matrix all the semi-supervised learning methods defined by various cr’s coincide. 


4 Properties and interpretations of the Regularized Lapla- 
cian method 

There is a number of interesting interpretations and characterizations which we can provide for 
the classification functions ©• These interpretations and characterizations will give different 
insights about the Regularized Laplacian kernel Qg and the classification functions ([2|). 

4.1 Discrete-time random walk interpretation 

The Regularized Laplacian kernel Qg = {I + PL)~^ can be interpreted as the overall transition 
matrix of a random walk on the similarity graph G with a geometrically distributed number of 
steps. Namely, consider a Markov chain whose states are our data points and the probabilities of 
transitions between distinct states are proportional to the corresponding entries of the similarity 
matrix A: 

Pij = b/ = 1, ■ • ■, 7^, 3, (8) 

where r > 0 is a sufficiently small parameter. Then the diagonal elements of the transition 
matrix P = (pij) are 

p^i = Ii = l,...,N (9) 
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or, in the matrix form, 

P = I-tL. (10) 

The matrix P determines a random walk on G which differs from the “standard” one de¬ 
fined by ([5|) and related to the PageRank heat kernel (jH). As distinct from m, the transition 
matrix (HOD is symmetric for every undirected graph; in general, it has a nonzero diagonal. It 
is interesting to observe that P coincides with the weight matrix W used for transformation of 
Subsection 3.2. 

Consider a sequence of independent Bernoulli trials indexed by 0,1,2,... with a certain 
success probability q. Assume that the number of steps, K, in a random walk is equal to the 
trial number of the first success. And let Xk be the state of the Markov chain at step k. Then, 
K is distributed geometrically: 

Vv{K = k} = q{l - q)\ fc = 0,l,2,..., 

and the transition matrix of the overall random walk after a random number of steps K, Z = (zij), 
Zij = Pr{XK = j \ Xq = i}, i,j = 1,..., N, is given by 

CX) CO 

Z = g^(l - qpP>^ = 9 

= (?(/- (1 - q){I - tL))~^ = {I + . 

Thus, Z = Qp = {I -\- fdL)~^ with (3 = r(g“^ — 1). 

This means that the i-th component of the classification function can be interpreted as the 
probability of finding the discrete-time random walk with transition matrix (uni) in node i after 
the geometrically distributed number of steps with parameter q, given the random walk started 
with the distribution 

4.2 Continuous-time random walk interpretation 

Consider the differential equation 

Hit) = -LHit), (11) 

with the initial condition i7(0) = I. Also consider the standard continuous-time random walk 
that spends exponentially distributed time in node k with the expected duration l/dj- and after 
the exponentially distributed time moves to a new node I with probability aki/dk- Then, the 
solution hij{t) = exp(—tL) of the differential equation (fTTl) can be interpreted as a probability to 
find the standard continuous-time random walk in node j given the random walk started from 
node i. By taking the Laplace transform of m we obtain 

H{s) = (si + L)-^ = s-pl + s-^L)-\ (12) 

Thus, the classihcation function m can be interpreted as the Laplace transform divided by 
1/s, or equivalently the z-th component of the classification function can be interpreted as a 
quantity proportional to the probability of finding the random walk in node i after exponen¬ 
tially distributed time with mean /3 = 1/s given the random walk started with the distribution 
Y,k/il^Y,k). 
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4.3 Proximity and distance properties 


As before, let = be the Regularized Laplacian kernel (/ + /3L) ^ of (I5|). 


Qfj determines a positive 1-proximity measure |14| s(i,j) := i.e., it satisfies |ld) the 

following conditions: 

(1) for any i € V, ^ffc = 1 ^nd 

(2) for any i,j, k € V, + q^ — gfj, < q^ with a strict inequality whenever i = k and i ^ j 
(the triangle inequality for proximities). 


This implies m the following two important properties: (a) for all i^j € V such 

that i ^ j {egocentrism property)-, (b) := /3((7^ + q^ — q^ — q^) it0 a distance on V. Because 

of the forest interpretation of Qp (see Section HU),' it is called the adjusted forest distance. 
The distances have a twofold connection with the resistance distance pij on G |16) . First, 
limp_>oo Pij = Pij j b j G V. Second, let G^ be the weighted graph such that: V(G^) = V (G)U{0}, 
the restriction of G^ to V(G) coincides with G, and G^ additionally contains an edge (i,0) of 
weight 1//3 for each node i € V(G). Then it follows that Pij{G) = pij{G^), i,j £ V. In the 
electrical interpretation of G, the weight 1//3 of the edges (i,0) is treated as conductivity, i.e., 
the lines connecting each node to the “hub” 0 have resistance /3. An interested reader can find 
more properties of the proximity measures determined by Qp in m- 

Furthermore, every Qp, /3 > 0 determines a transitional measure on V, which means [12j that: 
Qij Qjk ^ ^ik ifj hjjk £ V with qfj q^j if and only if every path in G from i to k 

visits j. 


It follows that dp := — In yQij/y QuQjjJ provides a distance on V. This distance is cutpoint 

additive, that is, dp + 4 = 4 if and only if every path in G from i to k visits j. In the 
asymptotics, dp becomes proportional to the shortest path distance and the resistance distance 
as /3 —>• 0 and /3 —>■ cx), respectively. 


4.4 Matrix forest characterization 

By the matrix forest theorem [HII], each entry qp of Qp is equal to the specific weight of the 
spanning rooted forests that connect node i to node j in the weighted graph G whose combina¬ 
torial Laplacian is L. 

More specifically, qp = , where is the total /3-weight of all spanning rooted forests 

ofG,-F4 being the total /3-weight of such of them that have node i in a tree rooted at j. Here, 
the P-weight of a forest stands for the product of its edges weights, each multiplied by (3. 

Let us mention a closely related interpretation of the Regularized Laplacian kernel Qp in 
terms of information dissemination m- Suppose that an information unit (an idea) must be 
transmitted through G. A plan of information transmission is a spanning rooted forest F in G: 
the information unit is initially injected into the roots of F; after that it comes to the other nodes 
along the edges of F. Suppose that a plan is chosen at random: the probability of every choice 
is proportional to the /3-weight of the corresponding forest. Then by the matrix forest theorem, 
the probability that the information unit arrives at i from root j equals qp = iF^_^jl. This 
interpretation is particularly helpful in the context of machine learning for social networks. 


^Cf. the cosine law m and the inverse covariance mapping |22l Section 5.2]. 
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4.5 Statistical characterization 

Consider the problem of attribute evaluation from paired comparisons. 

Suppose that each data point (node) i has a value parameter Vi, and a series of paired 
comparisons between the points is performed. Let the result of i in a comparison with j obey 
the Scheffe linear statistical model m 


E{r,j) =v,-Vj, (13) 

where E{-) is the mathematical expectation. The matrix form of (1131) applied to an experiment 
is 

E{r) =Xv, 

where v = (ui,..., uat)^, and r is the vector of comparison results, X being the incidence 
matrix [design matrix, in terms of statistics): if the fcth element of r is a comparison result of i 
confronted to j, then, in accordance with (1131) . Xki = 1, Xkj = —1, and Xki = 0 for I ^ {i, j}. 

Suppose that X is known, r being a sample, and the problem is to estimate d up to a shift 
[TUI Section 4]. Then 

v{X) = {XI + x'^xy^x'^r (14) 

is the well-known ridge estimate of v, where A > 0 is the ridge parameter. Denoting j3 = A“^ 
and X"^X = L (it is easily verified that X'^X is a Laplacian matrix whose (i, j)-entry with j ^ i 
is minus the number of comparisons between i and j) one has 

v{X) = (I + PL)-yX^r, (15) 

i.e., the solution is provided by the same transformation based on the Regularized Laplacian 
kernel as in ([U]) (cf. also (flUll L Here, the weight matrix A of G contains the numbers of com¬ 
parisons between nodes; s = X'^r is the vector of the sums of comparison results of the nodes: 
Si = Vij — 'y. rji, where and rji are taken from r, which has one entry (either or rji) 
for each comparison result. 

Suppose now that value parameter Vi (belonging to an interval centered at zero) is a positive 
or negative intensity of some property, and thus, Vi can be treated as a signed membership of data 
point i in the corresponding class. The pairwise comparisons r are performed with respect to this 
property. Then flX^^r = /?s is a kind of labeling function or a crude correlate of membership in 
the above class, whereas (1151) provides a refined measure of membership which takes into account 
proximity. Along these lines, (1151) can be considered as a procedure of semi-supervised learning. 

A Bayesian version of the model (TTUl) enables one to interpret and estimate the ridge parameter 
A = 1//3. Namely, assume that: 

(i) the parameters vi,... ,vn chosen at random from the universal set are independent random 
variables with zero mean and variance af and 

(ii) for any vector v, the errors in (|13l) are independent and have zero mean, their unconditional 
variance being 

It can be shown nni Proposition 4.2] that under these conditions, the best linear predictors 
for the parameters v are the ridge estimators dUD with P = <T\la\. 

The best linear predictors for v are the Vi’s that minimize E{vi — Vi)'^ among all statistics of 
the form Vi = a Cjr satisfying E{vi — Vi) = 0. 

The variances and can be estimated from the experiment. In fact, there are many 
approaches to choosing the ridge parameter, see, e.g., and the references therein. 
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5 Limiting cases 

Let us analyse the formula (HI in two limiting cases: /3 —0 and /3 —>■ oo. If /3 —>■ 0, we have 

F,k = {I - PL)Y,k + o{P). 

Thus, for very small values of /3, the method resembles the nearest neighbour method with the 
weight matrix W = I — /3L. If there are many points situated more than one hop away from any 
labelled point, the method cannot produce good classification with very small values of /3. This 
will be illustrated by the numerical experiments in Section 6. 

Now consider the other case (3 —>• oo. We shall employ the Blackwell series expansion [ZIISI] 
for the resolvent operator [\I + L)~^ with A = 1//3 

(/ + /IL)-i = A(A/ + L)-i 

= AQlll^ + Lf-A7L2 + ...^ , (16) 

where H = (L + is the generalized (group) inverse of the Laplacian. Since 

the first term in (nni gives the same value for all classes if k ^ I (which is 

typically the case), the classification will depend on the entries of the matrix H and finally, of 
the matrix [L + ^11^)“^. Note that the matrix (L + all^)“^, with a sufficiently small positive 
a, determines a proximity measure called accessibility via dense forests. Its properties are listed 
in [151 Proposition 10]. An interpretation of H in terms of spanning forests can be found in [T51 
Theorem 3]; see also |2(i| . 

The accessibility via dense forests violates a natural monotonieity condition, as distinct from 
(/ + (3L)~^ with a finite p. Thus, a better performance of the regularized Laplacian proximity 
measure with finite values of /3 can be expected. 

For the sake of comparison, let us analyse the limiting behaviour of the heat kernels. For 
instance, let us consider the Standard Laplacian heat kernel ([51), since it is also based on the 
Standard Laplacian. In fact, it is immediate to see that the Standard Laplacian heat kernel has 
the same asymptotic as the Regularized Laplacian kernel. Namely, if f —s- 0, 

H{t) = exp(—tL) = I — tL + o{t). 

Similar expressions hold for the other heat kernels. Thus, for small values of t, the semi-supervised 
learning methods based on heat kernels should behave as the nearest neighbour method. 

Next consider the Standard Laplacian heat kernel when t oo. Recall that the Laplacian 
L = D — A is a, positive definite symmetric matrix. Without the loss of generality, we can denote 
and rearrange the eigenvalues of the Laplacian as 0 = Ai < A 2 < ... and the corresponding 
eigenvectors as ui, ...,u„. Note that ui = 1. Thus, we can write 

N 

H ft) = uiuf + exp{—Xit)uiuf. 

i=2 

We can see that for large values of t the first term in the above expression is non-informative as 
in the case of the Regularized Laplacian method and we need to look for the second order term. 
However, in contrast to the Regularized Laplacian kernel, the second order term exp{—X 2 t)u 2 U 2 
is a rank-one term and cannot in principle give correct classification in the case of more than two 
classes. The second term of the Regularized Laplacian kernel H is not a rank-one matrix and as 
mentioned above can be interpreted in terms of proximity measures. 
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6 Numerical methods and examples 

Let us first discuss various approaches for the numerical computation of the classification func¬ 
tions Q- Broadly speaking, the approaches can be divided into linear algebra methods and 
optimization methods. One of the basic linear algebra methods is the power iteration method. 
Similarly to the power iteration method described in [0], we can write 

F,k = {I + PD - l3A)-^Y,k, 

F.k = (/ - /?(/ + PD)-^A)-\I + 

F,k = (/ - Pil + PD)-^DD-^A)-\I + PD)-^Y,k. 

Now denoting B := /?(/ -|- PD)~^D and C ■.= {I + PD)~^, we can propose the following power 
iteration method to compute the classification functions 

= BD-^AFi^^ + Cy.fc, s = 0,1,... , (17) 

with Since B is a diagonal matrix with the diagonal entries less than one, the 

matrix BD~^A is substochastic with the spectral radius less than one and the power iterations 
(113 are convergent. However, for large values of /? and di, the matrix BD ^A can be very 
close to stochastic and hence the convergence rate of the power iterations can be very slow. 
Therefore, unless the value of /3 is small, we recommend to use the other methods from numerical 
linear algebra for the solution of linear systems with symmetric matrices (recall that L is a 
symmetric positive semi-definite matrix in the case of undirected graphs). In particular, we 
tried the Cholesky decomposition method and the conjugate gradient method. Both methods 
appeared to be very efficient for the problems with tens of thousands of variables. Actually, 
the conjugate gradient method can also be viewed as an optimization method for the respective 
convex quadratic optimization problem such as (HD and dZD- A very convenient property of 
optimization formulations © and © is that the objective, and consequently, the gradient, can 
be written in terms of a sum over the edges of the underlying graph. This allows a very simple 
(and with some software packages even automatic) parallelization of the optimization methods 
based on the gradient. For instance, we have used the parallel implementation of the gradient 
based methods provided by the NVIDIA CUBA sparse matrix library (cuSPARSE) (3^ and it 
showed excellent performance. 

Let us now illustrate the Regularized Laplacian method and compare it with some other 
state of the art semi-supervised learning methods on two datasets: Les Miselables and Wikipedia 
Mathematical Articles. 

The first dataset represents the network of interactions between major characters in the novel 
Les Miserables. If two characters participate in one or more scenes, there is a link between these 
two characters. We consider the links to be unweighted and undirected. The network of the 
interactions of Les Miserables characters has been compiled by Knuth [27] . There are 77 nodes 
and 508 edges in the graph. Using the betweenness based algorithm of Newman and Girvan 
m we obtain 6 clusters which can be identified with the main characters: Valjean (17), Myriel 
(10), Gavroche (18), Gosette (10), Thenardier (12), Fantine (10), where in brackets we give the 
number of nodes in the respective cluster. First, we generate randomly (100 times) labeled points 
(two labeled points per class). In Figure[T]we plot average precision as a function of parameter /3. 
In lUI^ it was observed that the PageRank based semi-supervised method (obtained by taking 
tr = 0 in ©) is the only method among a large family of semi-supervised methods which is 
robust to the choice of the labelled data ISlIlISj. Thus, we compare the Regularized Laplacian 
method with the PageRank based method. As we can see for Figure [T|(a), the performance 
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Figure 1: Les Miserables Dataset. Labelled points are chosen randomly. 


of the Regularized Laplacian method is comparable to that of the PageRank based method on 
Les Miserables dataset. The horizontal line in Figure [T](a) corresponds to the PageRank based 
method with the best choice of the regularization parameter or the restart probability in the 
context of PageRank. Since the Regularized Laplacian method is based on graph Laplacian, we 
also compare it in Figure [TJ(b) with the three heat kernel methods derived from variations of 
the graph Laplacian. Specifically, we consider the three time-domain kernels based on various 
Laplacians: Standard Heat kernel ([SI), Normalized Heat kernel ®, and PageRank Heat kernel 
For instance, in the case of the Standard Heat kernel the classification functions are given 
by = H(t)Y^k- It turns out that all the three time-domain heat kernels are very sensitive 
to the value of the chosen time, t. Even though there are parameter settings that give similar 
performances of Heat kernel methods and the Regularized Laplacian method, the Regularized 
Laplacian method has a large plateau for values of (3 where the good performance of the method 
is assured. Thus, the Regularized Laplacian method is more robust with respect to the parameter 
setting than the heat kernel methods. 

To see better the behaviour of the heat kernel methods for large values of t, we have chosen 
a larger interval for t in Figure [H The performance of the heat kernel methods degrades quite 
significantly for large values of t. This is actually predicted by the asymptotics given in Section 5. 
Since we have more than two classes, the heat kernels with rank-one second order asymptotics 
are not able to distinguish among the classes. All heat kernel methods as well as the Regularized 
Laplacian method show a deterioration in performance for small values of t and (3. This was 
predicted in Section 5, as all the methods start to behave like the nearest neighbour method. In 
particular, as follows from the asymptotics of Section 5 and can be observed in the figures the 
Standard Laplacian heat kernel method and the Regularized Laplacian method shows exactly 
the same performance when t 0 and /3 ^ 0. 

It was observed in [S] that taking labelled data points with large (weighted) degree is typically 
beneficial for the semi-supervised learning methods. Thus, we now label randomly two points out 
of three points with maximal degree for each class. The average precision is given in Figure|31(a). 
We also test heat kernel based methods with the same labelled points, see Figure |3l(b). One can 
see that if we choose the labelled points with large degree, the Regularized Laplacian Method 
outperforms the PageRank based method. Some heat kernel based methods with large degree 
labelled points also outperform the PageRank based method but their performance is much less 
stable with respect to the value of parameter t. 
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Figure 2: Les Miserables Dataset. Heat Kernel methods vs PR method, larger t. 
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Figure 3: Les Miserables Dataset. Labelled points are chosen with large degrees. 
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(a) RL Method vs PR method 


(b) Heat Kernel methods vs PR method 


Figure 4: Wiki Math Dataset. Labelled points are chosen randomly. 


Next, we consider the second dataset consisting of Wikipedia mathematical articles. This 
dataset is derived from the English language Wikipedia snapshot (dump) from January 30, 
201(11. The similarity graph is constructed by a slight modification of the hyper-text graph. Each 
Wikipedia article typically contains links to other Wikipedia articles which are used to explain 
specific terms and concepts. Thus, Wikipedia forms a graph whose nodes represent articles 
and whose edges represent hyper-text inter-article links. The links to special pages (categories, 
portals, etc.) have been ignored. In the present experiment we did not use the information 
about the direction of links, so the similarity graph in our experiments is undirected. Then 
we have built a subgraph with mathematics related articles, a list of which was obtained from 
“List of mathematics articles” page from the same dump. In the present experiments we have 
chosen the following three mathematical classes: “Discrete mathematics” (DM), “Mathematical 
analysis” (MA), “Applied mathematics” (AM). With the help of AMS MSC ClassificatioiJl and 
experts we have classified related Wikipedia mathematical articles into the three above mentioned 
classes. As a result, we obtained three imbalanced classes DM (106), MA (368) and AM (435). 
The subgraph induced by these three topics is connected and contains 909 articles. Then, the 
similarity matrix A is just the adjacency matrix of this subgraph. 

Eirst, we have chosen uniformly at random 100 times 5 labeled nodes for each class. The 
average precisions corresponding to the Regularized Laplacian method and the PageRank based 
method are plotted in Eigure|31(a). We also provide the results for the three heat kernel based 
methods in Figure SI (b). As one can see, the results of Wikipedia Mathematical articles dataset 
are consistent with the results of Les Miserables dataset. 

Then, for each class out of 10 data points with largest degrees we choose 5 points and average 
the results. The average precisions for the Regularized Laplacian method, PageRank based 
method and for the three heat kernel based methods are plotted in Figure [SJ The results are 
again consistent with the corresponding results for Les Miserables dataset. We would like to 
mention that for the computations in the Wiki Math dataset with many parameter settings and 
extensive averaging using NVIDIA CUDA sparse matrix library (cuSPARSE) [39] were noticeably 
faster than using numpy.linalg.solve calling LAPACK routine _gesv. 

Finally, we would like to recall from Subsection 4.5 that a good value of /3 can be provided 

^http://download.wikimedia.org/enwiki/20100130 
®http://www.ams.org/mathscinet/msc/msc2010.html 
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Figure 5: Wiki Math Dataset. Labelled points are chosen with large degree. 


by the ratio cri/(T|, where (t\ is the variance related to the data points and cr| is the variance 
related to the paired comparison between points. We can argue that erf is naturally large and 
the paired comparisons between points can be performed with much more certainty, and hence, 
erf is small. This gives a statistical explanation why it is good to take relatively large values for 
the parameter /3 in the Regularized Laplacian method. 

7 Conclusions 

We have studied in detail the semi-supervised learning method based on the Regularized Lapla¬ 
cian. The method admits both linear algebraic and optimization formulations. The optimization 
formulation appears to be particularly well suited for parallel implementation. We have pro¬ 
vided various interpretations and proximity-distance properties of the Regularized Laplacian 
graph kernel. We have also shown that the method is related to the Scheffe linear statistical 
model. The method was tested and compared with the other state of the art semi-supervised 
learning methods on two datasets. The results from the two datasets are consistent. In particu¬ 
lar, we can conclude that the Regularized Laplacian method is comparable in performance with 
the PageRank based method and outperforms the related heat kernel based methods in terms of 
robustness. 

Several interesting research directions remain open for investigation. It will be interesting to 
compare the Regularized Laplacian method with the other semi-supervised methods on a very 
large dataset. We are currently working in this direction. We observe that there is a large 
plateau of /3 values for which the Regularized Laplacian method performs very well. It will be 
very useful to characterize this plateau analytically. Also, it will be interesting to understand 
analytically why the Regularized Laplacian method performs better when the labelled points 
with large degree are chosen. 
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